Lesson 5: Data Visualization

Ashir Borah and Natalie Elphick

October 30th, 2024

Press the “?” key for tips on navigating these slides

BaseR plotting

  • Don’t look as good

  • Hard to build more complex plots, and fine-tune

plot(gapminder$gdpPercap, gapminder$lifeExp)

Grammar of graphics

  • Wilkinson (2005) laid out “Grammar of Graphics”

ggplot2

  • Hadley Wickham implemented the grammar of graphics in R package ggplot2

What is a statistical graphic?

  • Take variables from a dataset

  • map them to aes()thetic attributes

  • of geom_etric objects

Example with Gapminder data

How are variables mapped to aesthetic attributes of points?

How to use it?

Construct a graphic by adding modular pieces

  • ggplot(data, mapping)

  • Define aesthetic mappings with aes() function

    • e.g. aes(x = var1, y = var2)
  • Add ‘layers’ of geometric objects

    • e.g. geom_point()
  • Adjustments to axis scales, colors, labels, aesthetic mods

  • “Chaining” together ggplot components (use + rather than %>%)

    • + rather than %>% is unfortunate and hard to remember!

Using ggplot

The key is to understand the concepts and basic mechanics

The details for any given plot type, or attribute are easy to look up

Let’s try a scatterplot

gap_92 <- gapminder %>% 
  filter(year == 1992) %>% 
  mutate(gdp = gdpPercap * pop / 1e9) 
gap_92 %>% head(4)
# A tibble: 4 × 7
  country     continent  year lifeExp      pop gdpPercap    gdp
  <chr>       <chr>     <int>   <dbl>    <int>     <dbl>  <dbl>
1 Afghanistan Asia       1992    41.7 16317921      649.  10.6 
2 Albania     Europe     1992    71.6  3326498     2497.   8.31
3 Algeria     Africa     1992    67.7 26298373     5023. 132.  
4 Angola      Africa     1992    40.6  8735988     2628.  23.0 

Let’s try a scatterplot

ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp)) + 
  geom_point()

Scales

ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp)) + 
  geom_point() +
  scale_x_log10() 

Scales

  • Change how data values are translated to visual properties

    • scale_x_log10(), scale_y_reverse()
  • Change limits of axes:

    • xlim(0, 10)
  • Applies to other attributes as well

    • Fine-tune color, shape, size aesthetics.

Adding more aesthetic mappings (shape)

ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp, shape = continent)) + 
  geom_point() +
  scale_x_log10() 

Adding more aesthetic mappings (color)

ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp, color = continent)) + 
  geom_point() +
  scale_x_log10() 

Labels

labs function adds custom axis labels and titles

ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp)) + 
  geom_point() +
  scale_x_log10() +
  labs(x = 'Gross Domestic Product (Billions $)',
       y = 'Life Expectancy at birth (years)',
       title = 'Gapminder for 1992')

Key geoms

  • Comparing 2 continuous variables

    • Scatterplot: geom_point()

    • Line graph: geom_line()

    • Smoothing functions: geom_smooth()

  • Summarizing distribution of a single variable

    • Histogram: geom_histogram()

    • Density: geom_density()

  • Discrete vs continuous

    • Boxplot: geom_boxplot

    • Bar graph: geom_col()

    • Violin plot: geom_violin()

And many more…

Comparing 2 continuous variables

geom_line

df <- gapminder %>% 
  filter(country == 'Romania') 
ggplot(df, mapping = aes(x = year, y = lifeExp)) + 
  geom_line()

Layering geoms

We can add as many geoms to a plot as we want, stacked on as ‘layers’ in order

ggplot(df, mapping = aes(x = year, y = lifeExp)) + 
  geom_line() +
  geom_point()

What if we had multiple data points per year?

df <- gapminder %>% 
  filter(country %in% c('Romania', 'Thailand'))
ggplot(df, mapping = aes(x = year, y = lifeExp)) + 
  geom_line() +
  geom_point()

Need to separate them by country (group aesthetic)

ggplot(df, mapping = aes(x = year, y = lifeExp, group = country)) + 
  geom_line() +
  geom_point()

Often useful to color lines by group, use color aesthetic with a categorical variable and it automatically groups

ggplot(df, mapping = aes(x = year, y = lifeExp, color = country)) + 
  geom_line() +
  geom_point()

Multiple aesthetic mappings

  • Set overall mapping in ggplot() but can override this for individual ‘geoms’
ggplot(df, mapping = aes(x = year, y = lifeExp)) + 
  geom_line(mapping = aes(color = country)) +
  geom_point()

Multiple aesthetic mappings

  • You can ‘hard-code’ aesthetic properties for each geom
ggplot(df, mapping = aes(x = year, y = lifeExp, color = country)) + 
  geom_line(linetype = 'dashed', size = 0.5) +
  geom_point(color = 'black', size = 3, alpha = 0.75)

Plotting trendlines

How to depict the ‘average’ relationship between noisy variables?

ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp)) + 
  geom_point() + 
  scale_x_log10() +
  labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)') 

Plotting trendlines

geom_line() doesn’t work!

ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp)) + 
  geom_line() +
  geom_point() + 
  scale_x_log10() +
  labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)') 

geom_smooth

geom_smooth() shows the average (‘smoothed’) relationship

ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp)) + 
  geom_point() + 
  geom_smooth() +
  scale_x_log10() +
  labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)') 

geom_smooth

Can be used to show a linear trendline

ggplot(gap_92, mapping = aes(x = gdp, y = lifeExp)) + 
  geom_point() + 
  geom_smooth(method = 'lm') +
  scale_x_log10() +
  labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)') 

geom_smooth to simplify plots

Can be very helpful to condense down relationships from complicated data

ggplot(gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) + 
  geom_point() +
  scale_x_log10() 

geom_smooth to simplify plots

Can be very helpful to condense down relationships from complicated data

ggplot(gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) + 
  geom_smooth(method = 'lm') +
  scale_x_log10() 

Types of plots

  • Above were all examples based around plotting 2 continuous variables (other ‘aesthetics’ can encode additional variables

  • Other common scenarios are:

    • Plot distribution of a single variable (continuous or discrete)

    • Plot the distribution of a continuous variable against a discrete variable

Plotting a single variable

geom_bar

Given a single discrete variable we can plot its distribution as a ‘bar plot’ using geom_bar()

ggplot(gapminder, mapping = aes(x = continent)) +
  geom_bar()

geom_histogram

For a single continuous variable, we can generate a histogram using geom_histogram which bins the values and then makes a bar plot

ggplot(gapminder, mapping = aes(x = gdpPercap)) +
  geom_histogram() 

We can adjust the axis scale and other features as usual

ggplot(gapminder, mapping = aes(x = gdpPercap)) +
  geom_histogram() +
  scale_x_log10()

geom_histogram

We can change the number of bins (can also specify details of bin positions)

  ggplot(gapminder, aes(gdpPercap)) +
  geom_histogram(bins = 100) +
  scale_x_log10()

geom_histogram

Can also encode different continents in different colors by stacking the histograms

ggplot(gapminder, mapping = aes(x = gdpPercap, color = continent)) +
  geom_histogram() +
  scale_x_log10()

fill vs color

ggplot(gapminder, mapping = aes(x = gdpPercap, fill = continent)) +
  geom_histogram() +
  scale_x_log10()

geom_density

Density plots are another way to depict the distribution of a continuous variable. They are just a smoothed histogram

ggplot(gapminder, mapping = aes(x = gdpPercap)) +
  geom_density() +
  scale_x_log10()

geom_density

Separate by continent and give spearate fill colors

ggplot(gapminder, mapping = aes(x = gdpPercap, fill = continent)) +
  geom_density(alpha = 0.5) +
  scale_x_log10()

1 continuous var vs 1 discrete

geom_boxplot

The boxplot is the most common choice for showing the distribution of a continuous variable broken down by a categorical variable

ggplot(gapminder, mapping = aes(x = continent, y = gdpPercap)) +
  geom_boxplot() +
  scale_y_log10()

geom_violin

The violin plot is similar, but shows the distribution as a density plot, rather than a box.

ggplot(gapminder, mapping = aes(x = continent, y = gdpPercap)) +
  geom_violin() +
  scale_y_log10()

Geom_beeswarm

Another useful option is a ‘dotplot’ or ‘beeswarm’ plot.

library(ggbeeswarm)
ggplot(gapminder, mapping = aes(x = continent, y = gdpPercap)) +
  geom_beeswarm(size = 0.5, alpha = 0.75, cex = 1) +
  scale_y_log10()

What if I want to control the order?

  • By default x-axis values ordered alphabetically

  • Need to use the idea of a factor

  • Factors used to encode categorical variables, specify the possible ‘levels’, and optionally an ordering

cont_order <- c('Oceania', 'Europe', 'Americas', 'Asia', 'Africa')
gap_cat <- gapminder %>% 
  mutate(continent = factor(continent, levels = cont_order))
head(gap_cat)
# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <chr>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.

What if I want to control the order?

ggplot(gap_cat, mapping = aes(x = continent, y = gdpPercap)) +
  geom_boxplot() +
  scale_y_log10()

What if I want to control the order?

forcats package has lots of useful helper functions for changing order of factor variables.

gap_cat <- gap_cat %>% 
  mutate(continent = fct_reorder(continent, gdpPercap, median))
ggplot(gap_cat, mapping = aes(x = continent, y = gdpPercap)) +
  geom_boxplot() +
  scale_y_log10()

geom_col

If you want to plot a single value for each of a continuous variable, use geom_col

gap_82 <- gapminder %>% 
  filter(year == 1982, continent == 'Americas')

ggplot(gap_82, mapping = aes(x = country, y = gdpPercap)) + 
  geom_col()

theme

  • You can customize MANY details of the plot using the theme function

  • It’s a bit complicated at first, but most common changes are easy to google.

Saving your plots

  • ggsave
ggplot(gapminder, mapping = aes(x = continent, y = gdpPercap)) +
  geom_violin() +
  scale_y_log10()
ggsave(filename = here::here('results', 'my_fig.png'))
  • Using the Rstudio GUI

Key practical tips

  • You don’t need to remember the details, just the basic mechanics. You can quickly look up the details (check out this useful ggplot cheat sheet)

  • Find example plots online that you like and just copy/paste as a template. Browse the ggplot gallery

Additional Resources/References

Additional material

Some notes on using color

If we map a continuous variable to color it won’t group automatically

ggplot(df, mapping = aes(x = year, y = lifeExp, color = gdpPercap)) +
  geom_line() +
  geom_point(size = 3)

Some notes on using color

We need to specify group manually

ggplot(df, mapping = aes(x = year, y = lifeExp,
                         group = country, color = gdpPercap)) +
 geom_line() +
  geom_point(size = 3)

Some notes on using color

  • Assume continuous map for numeric data, discrete map for strings

  • Make numeric data into factors if you want discrete colors

my_df <- gapminder %>%
  filter(year %in% c(1957, 1977, 1997))
ggplot(my_df, mapping = aes(x = gdpPercap, y = lifeExp, color = factor(year))) +
  geom_point() +
  scale_x_log10() +
  labs(color = 'year')

Color palettes

We can use scale_color_manual to set the color of each group manually

my_cols <- c(Romania = 'green', Thailand = 'orange')

ggplot(df, mapping = aes(x = year, y = lifeExp, color = country)) +
  geom_line() +
  scale_color_manual(values = my_cols)

scale_color_brewer offers some useful default color schemes

ggplot(df, mapping = aes(x = year, y = lifeExp, color = country)) +
  geom_line() +
  scale_color_brewer(palette = 'Dark2')

Rcolorbrewer

https://www.r-bloggers.com/a-detailed-guide-to-ggplot-colors/

Facets

Facets allow you to easily break a single plot into multiple plots based on variable.

gap_early <- gapminder %>%
  filter(year < 1970)
ggplot(gap_early, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  scale_x_log10() +
  facet_wrap(~continent)

Or based on multiple variables

ggplot(gap_early, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  scale_x_log10() +
  facet_grid(year ~ continent)

geom_text

gap_df <- gapminder %>%
  filter(year == 1992, continent == 'Americas') %>%
  mutate(gdp = gdpPercap * pop / 1e9) %>%
  head(20)

You can add text labels to the points with geom_text

ggplot(gap_df, mapping = aes(x = gdp, y = lifeExp, label = country)) +
  geom_text() +
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE) +
  scale_x_log10() +
  labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)')

Or with geom_label

ggplot(gap_df, mapping = aes(x = gdp, y = lifeExp, label = country)) +
  geom_label() +
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE) +
  scale_x_log10() +
  labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)')

ggrepel

  • Text labels are often not placed optimally

  • ggrepel is a very useful package that will automatically find good positioning for labels

library(ggrepel)

ggplot(gap_df, mapping = aes(x = gdp, y = lifeExp)) +
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE) +
  scale_x_log10() +
  labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)') +
  geom_label_repel(aes(label = country), size = 2.5)

Beautification

There are lots of ways to add aesthetic improvements to your figures relatively easily

There are a number of pre-packaged ‘themes’ you can apply

my_plot + theme_minimal()

Tip for making nice scatterplots

Set the marker shape to one that can be ‘filled’ (pch = 21 is a filled circle), then use a thin white border around a filled shape to help distinguish overlaps.

ggplot(gap_92, aes(gdp, lifeExp)) + 
  geom_point(pch = 21, stroke = 0.5, alpha = 0.8, size = 2.5, color = 'white', aes(fill = continent)) + 
  scale_x_log10() +
  labs(x = 'Gross Domestic Product (Billions $)', y = 'Life Expectancy at birth (years)', title = 'Gapminder for 1992') +
  theme_minimal()

ggpubr

Add stats directly to your figures

library(ggpubr)
my_comparisons <- list( c("Africa", "Asia"), c('Europe', 'Oceania'))
ggplot(gapminder, mapping = aes(x = continent, y = gdpPercap)) +
  geom_violin() +
  scale_y_log10() +
  stat_compare_means(method = 'wilcox.test', comparisons = my_comparisons)

ggpubr

Easily add correlation coefficients

ggplot(gap_92, mapping = aes(x = lifeExp, y = gdpPercap)) +
  geom_point() +
  scale_y_log10() +
  geom_smooth(method = 'lm') +
  stat_cor()

cowplot

Great tool for combining multiple ‘panels’ into one plot

library(cowplot)

p1 <- ggplot(mtcars, aes(disp, mpg)) + 
  geom_point()
p2 <- ggplot(mtcars, aes(qsec, mpg)) +
  geom_point()
plot_grid(p1, p2, labels = c('A', 'B'))

ComplexHeatmap

  • ggplot2 struggles to make large heatmaps (geom_tile), for this ComplexHeatmap is the preffered tool

  • See VERY detailed documentation with examples here

  • Also contains useful information on the basics of hierarchical clustering

Recap

  • Define aesthetic mappings with aes() function
  • Add layers using use + rather than %>%
  • Adjustment axis scales, colors, labels using a theme() layer
  • Add stats to your figure using a stat_ layer